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Abstract 

Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume 
we are allowed to say 'predict 0' or 'predict 1', and our payoff is +1 if the prediction is correct and 
— 1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong 
predictions minus the number of right predictions so far. In this paper we are interested in algorithms that 
have essentially zero (expected) loss over any string at any point in time and yet have small regret with 
respect to always predicting or always predicting 1. This can be formulated as the experts problem in 
which we require exponentially small regret with respect to one special expert (which corresponds to the 
'no prediction' strategy in our setting). It was shown by Even-Dar et al. (COLT' 07) that constant expected 
loss can be achieved. In this paper we give an algorithm that has small regret and exponentially small loss 
(in expectation), achieving the optimal tradeoff between the two. For a sequence of length T our algorithm 
has an amortized per time step regret e and loss e~ e T+1 /v / T in expectation for all strings. The algorithm 
extends to the general expert setting, yielding essentially zero loss with respect to the special expert and 
optimal loss/regret tradeoff. 

We show that the strong loss bounds of the algorithm have some surprising consequences. First, we 
obtain a parameter free algorithm for the experts problem that has optimal adaptive regret bounds, i.e. 
bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for any 
window of size n the regret of our algorithm to any expert never exceeds 0(y/n(\og N + logT)), where 
N is the number of experts and T is the time horizon. For example, it implies that an adversarial real- 
valued signal can be predicted in such a way that the average difference between the prediction and the 
signal is 'noise-like' at every time scale. To the best of our knowledge, such guarantees are not provided 
by any known algorithm for the experts problem. We also show applications of our framework to the 
multi-armed bandit problem with partial information and online convex optimization. 
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1 Introduction 



In this paper we revisit the multi-armed bandit problem with full information, or the experts problem, a 
well-studied variant of online optimization. In this problem the decision maker iteratively chooses among 
N available alternatives without knowledge of their costs, and gets payoff based on the chosen alternative. 
The costs of all alternatives are revealed after the decision is made. This process is repeated over T rounds, 
and the goal of the decision maker is to maximize her cumulative payoff over all time steps t = 1, . . . , T. 
This problem and its variations has been studied extensively, and efficient algorithms have been obtained 
(e.g. (9l|25l[T0l|4l). The most widely used measure of performance of an online decision making algorithm is 
regret, which is defined as the difference between the payoff of the best fixed alternative and the payoff of the 
algorithm. The well-known weighted majority algorithm of |[25l obtains regret 0(y/T log N) even when no 
assumptions are made on the process generating the payoff. Regret to the best fixed alternative in hindsight is 
a very natural notion when the payoffs are sampled from an unknown distribution, and in fact such scenarios 
show that the bound of 0(\/T log N) on regret achieved by the weighted majority algorithm is optimal. 

When the sequence of payoffs is adversarial, other approaches to measuring the performance of an online 
algorithm may be valuable, and several such approaches have been proposed and analyzed in the literature. 
In this paper we generalize two such approaches and give significant improvements on existing algorithms. 

The stalling point of our development is the observation that standard online decision making algorithms 
(e.g. weighted majority) provide uniform regret guarantees in the sense that the regret of the algorithm 
to any alternative does not depend on the alternative itself - regret to any alternative is guaranteed to be 
0{^T log A/). In some applications, however, it may be useful to allow for larger regret with respect to some 
alternatives at the expense of having small regret to other 'special' alternatives. As a motivating example, 
consider the following bit prediction game, where we will have only one special alternative. The decision 
maker is presented with a sequence of bits and it trying to predict the next bit at each time step. Every time 
she is allowed to say 'predict 0' , 'predict 1 ' or 'no prediction' . One could think of the bits as indications of 
whether a stock price goes up or down on a given day, if one assumes that the change in the price of a stock 
is always ±1. If the the decision maker predicts 1 (i.e. that the stock will go up), she buys one stock to sell 
it the next day, and short sells one stock if her prediction is 0. Thus, a correct prediction gives payoff +1 and 
an incorrect prediction gives payoff —1 (this is, of course, a very simplified model of the stock market). We 
will say that for a sequence of bits her loss at any point in time is the (expected) number of wrong predictions 
minus the number of right predictions so far. Note that in this setting the strategy of not predicting is special 
in that it does not lead to any loss. Thus, the following question arises: does there exist an algorithm that has 
zero expected regret to the 'no prediction' strategy, i.e. zero expected loss, and yet has small regret to the other 
two strategies ('predict 0' and 'predict 1')? What tradeoffs between these parameters can be achieved? This 
question was asked in the literature before. Even-Dar et al. [12] gave an algorithm that has constant regret to 
any fixed distribution on the experts at the expense of regret 0(\/T logiV (log T + log log N)) with respect to 
all other expert^]. In this paper we improve upon the results of |[T2l by showing that while zero expected loss 
is impossible, for any e > 1/ vT there exists an algorithm that achieves regret at most 4e and expected loss 
at most e~ e T+1 /VT. Thus, the loss is exponentially small in the length of the sequence. This yields regret 
O^TilogN + logT)) to the best and <3((iVT)~ 2 ) to the average in the setting of 021. We also provide a 
lower bound that shows that this exponential tradeoff is optimal up to constant factors. The result extends to 
the general case of N experts, yielding an optimal regret/loss tradeoff. 

An important property of our algorithm is that it does not need a high imbalance between the number of 
ones and the number of zeros in the whole sequence to have a gain: it is sufficient for the imbalance to be 

'in fact, 1121 provide several algorithms, of which the most relevant for comparison are Phased Agression, yielding 
0{y/T log iV (logT + log log AO) regret to the best and D-Prod, yielding 0{\jT j log iV log T) regret to the best. For the bit 
prediction problem one would set N = 2 and use the uniform distribution over the 'predict 0' and 'predict 1' strategy as the special 
distribution. Our algorithm improves on both of them, yielding an optimal tradeoff. 
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large enough in at least one contiguous time window^, the size of which is a parameter of the algorithm. 

The bit prediction problem also motivates the question of whether or not strong non-uniform bounds can 
be obtained, which we now describe. In particular, instead of requiring the algorithm to be competitive only 
to the 'predict 0', 'predict 1' and 'no prediction' strategies, one may want to have regret guarantees to more 
complex prediction strategies. Moreover, it seems natural to require better regret bounds with respect to 
'simple' strategies (an extreme of this is the exponentially small regret against the 'no prediction' strategy 
that we obtain). One measure of complexity of a prediction strategy is its Kolmogorov complexity, which we 
denote by comp(S'). Surprisingly, the strong exponential tradeoff between regret and loss allows us to obtain 
an algorithm that has regret at most O ( Tcomp(S ) log T) for any S and loss at most 0(T~ 2 ), for example. It 
is perhaps interesting to note that it is impossible to get regret o(comp(£) / log T), i.e. our algorithm is within 
a 0(log 2 T) factor of optimum when comp(5) = B(T/logT). It should be noted that such non-uniform 
guarantees (without the small loss property) can be obtained directly from the weighted majority algorithm. 

While regret to the best fixed expert in hindsight is a very natural measure of performance when the 
payoffs are sampled from fixed unknown distributions, a more stringent measure of performance is desirable 
in a potentially changing adversarial environment. Several results providing more refined characterizations 
have been obtained in the literature. A line of work on 'tracking the best expert' (e.g., |[T9l [6l) proves regret 
bounds against /c-shifting optimum, i.e. to the best solution that is forced to be fixed on each of k + 1 
intervals covering 1, . . . , T, under the assumption that the loss function is exp-concave. Recently, [18 ] gave 
an algorithm that has adaptive regret, i.e. regret against the best expert in any subinterval of [1..T], at most 
0{^/T logT), which also implies regret bounds against the /c-shifting optimum. We improve significantly on 
these results by showing that for any partition of the interval [1 : T] into disjoint intervals Ij,j = 1, . . . , k, 
the regret of our algorithm to the optimum that may use a different expert in each interval Ij is bounded by 



which is optimal up to logarithmic factors. Thus, our algorithm has almost optimal regret bounds with respect 
to /c-shifting optima for any k. 

The guarantees above bound the total regret of the algorithm in terms of the regret over different experts 
assigned to elements of the partition and does not necessarily say anything about the regret in a specific 
interval of time, like the result of [18 ] does. We show that our algorithm has another very strong property that 
generalizes lPT8l . For any sequence of real numbers r t we say that r is Z -uniform at scale n if 



for all t. Intuitively, a sequence is Z-uniform at scale n if its mean in any geometric window of size n does not 
exceed the variance of a sum of independent Ber(±l, 1/2) random variables over that interval. In particular, 
if rj are Ber(±l, 1/2), then r is Z-uniform at scale n with probability at least 1 — ZT. Denote the payoff of 
our algorithm at time t by s# t* an d denote the payoff of expert i at time t by Sj^. Our algorithm guarantees 
that the sequence s^t — s*^ is Z-uniform at any scale n > 40 log(iVT) for any Z < (NT)~ 2 . Thus, the regret 
of our algorithm to any expert in any window of length n is bounded by 0(i/n(log N + logT)). Note that 
this property does not follow from (Q]). In particular, while (Q]) provides guarantees against a decomposition 
into disjoint intervals, the Z-uniform property involves guarantees on payoff on any geometric window. One 
interesting consequence of the Z-uniform property is the following: given an adversarial real-valued signal xt 
one can output a prediction x t that is withing 0(^n log(l/Z)) of the performance of the best fixed predictor 

2 More precisely, we use an infinite window with geometrically decreasing weighting, so that most of the weight is contained in 
the window of size 0(n), where n is a parameter of the algorithm. 




(1) 



- l/nf-i < 0(Vnlog(l/Z)) 



i=i 
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in any window of length n, i.e. the prediction error exhibits properties of random noise! To the best of our 
knowledge, no known algorithm possesses this property. 

We also show how risk-free assets can be constructed using our algorithm under the assumption of 
bounded change of price of a stock. This application motivates studying the effect of transaction costs on 
the algorithm. It turns out that one can still get bounded loss at the expense of making regret commensu- 
rate with the transaction cost. This involves treating the confidence values of the algorithm as probabilities 
of selling or buying. We derive the corresponding high probability bounds on the loss of the algorithm in 
section [6l 

Finally, we show that our techniques can be applied to the multi-armed bandit problem with partial infor- 
mation (see, e.g. 0]), giving an algorithm with 0(iV 1/3 T 2/3 log 1/3 (iVT)) regret and loss 0((iVT)~ 2 ) with 
respect to the average of all arms. Additionally, we show how our framework can be applied to the online 
convex optimization algorithm of Zinkevich ll28l to obtain an algorithm with good adaptive regret guarantees 
(see section U}- 

1.1 Related work 

In the general online decision problem the decision maker has to choose a decision from a set of available 
alternatives at each point in time t = 1, . . . , T without knowing future payoffs of the available alternatives. 
At each time step t the payoffs of alternatives at time t are revealed to the decision maker after she commits 
to a choice. Online decision problems have been studied under different feedback models and assumptions on 
the process generating the payoffs. The transparent feedback, or full information, model costs of all available 
alternatives are revealed, while in the opaque feedback, or partial information model only the cost of the 
decision that was made is revealed to the algorithm. The performance of an online decision making algorithm 
is usually measured in terms of regret, i.e. the difference between the payoff of the algorithm and the payoff 
of the best fixed alternative in hindsight. 

Various assumptions on the process generating the payoff of arms have been considered in the literature. 
When a prior belief on the distribution of payoffs is assumed, the discounted reward with infinite time horizon 
can be efficiently maximized using the Gittins index (see, e.g. Ifl31l26l0 . Low -regret algorithms for the setting 
when the payoffs come from an unknown probability distribution were obtained in (3l|2l|24l. Assumptions 
on the payoff sequence are not necessary to achieve low regret. In particular, the well-known weighted 
majority algorithm [25] yields 0(\/T log N) regret in the full information model (also known as the experts 
problem). Surprisingly, [4] showed that low regret with respect to the best arm in hindsight can be achieved 
without making any assumptions on the payoff sequence even in the partial information model, giving the first 
algorithm with 0(\/NT log N) regret in this setting. Better bounds have been obtained under the assumption 
that the sequence of payoffs has low variance (e.g. |fl6l ). A related line of work applying similar techniques 
to problems in finance includes |[T0ll20llT7l . 

More specialized techniques have been developed for the online optimization problem in both the full 
and partial information models ( |[2Tl[TTl l5ll). Better bounds can be obtained under the convexity assumption 
C ll28l [T3l [TTl ITTl ^ . Another line of work focuses on obtaining good regret guarantees when the space of available 
alternatives is very large or possibly infinite, but has some special structure (e.g. forms a metric space) - 
Il23ll22l . It is hard to faithfully represent the large body of work on online decision problems in limited space, 
and we refer the reader to Q for a detailed exposition. 

Other measures of performance of an online algorithm have been considered in the literature. The ques- 
tion of which tradeoff between can be achieved if one would like to have a significantly better guarantee 
with respect to a fixed arm or a distribution on arms was asked before in lPT2l as we discussed in the intro- 
duction. Besides improving on the result of [12J, we also answer the question left open by the authors: 'It 
is currently unknown whether or not it is possible to strengthen Theorem 6 to say that any algorithm with 
regret 0(\/T logT) to the best expert must have regret Q(T e ) to the average for some constant e > 0'. In 
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fact, for any 7 > our algorithm has loss 0(T~ 7 ) (corresponding to regret to the average) when the regret 
is 0(^J^T logT), thus showing that such a strengthening is impossible. Tradeoffs between regret and loss 
were also examined in l27l . where the author studied the set of values of a, b for which an algorithm can have 
payoff aOPT + b log N, where OPT is the payoff of the best arm and a, b are constants. The problem of bit 
prediction was also considered in lfl4ll . where several loss functions are considered. None of them, however, 
corresponds to our setting, making the results incomparable. 

In recent work on the NormalHedge algorithm! 8 ] the authors use a potential function which is very sim- 
ilar to our function g{x) (see (0 below), getting strong regret guarantees to the e-quantile of best experts. 
However, the use of the function g{x) seems to be quite different from ours, as is the focus of the paper (H. 



1.2 Our results 

We now give a formal statement of our results. In section [T31 we state our results on predicting binary se- 
quences with low regret to the 'predict 0' and 'predict 1' strategies and exponentially small loss. In section [L4l 
we show how results of section [131 extend to predicting with low regret to any set of N prediction strategies 
and exponentially small loss. Note that this is equivalent to the experts problem in the full information model. 
Finally, in section [T31 we give results on adaptive regret of our algorithm, i.e. the Z-uniform property of the 
payoff sequence. 

We start by defining the bit prediction problem formally. Let b t , t = 1, . . . , T be an adversarial sequence 
of real numbers, — 1 < bt < 1. At each time step t = 1, . . . , T the algorithm is required to output a confidence 
level ft, and then the value of bt is revealed to it. The payoff of the algorithm is G = Ylt=i fth- F° r example, 
if&t G {— 1, +1}, then this setup is analogous to a betting process in which a player observes a sequence of 
bits and at each point in time bets an \f t \ amount of money on the value of the next bit being sign(/ t ). Betting 
ft = amounts to not playing the game, and incurs no loss, while not bringing any profit. We use the term loss 
of an algorithm A to denote max&(— A(b)). It is easy to see that any algorithm that has a positive expected 
gain on some sequence necessarily loses on another sequence. Thus, we will be concerned with finding a 
betting strategy that has a bounded loss but also has low regret against a number of predefined strategies. We 
now consider the bit prediction game and design an algorithm that has low regret against two basic strategies: 
S + , which always bets +1 and which always bets —1. We denote the base random strategy by Sq. In 
what follows we will use the notation gain to denote the average (per timestep) gain of an algorithm. 

1.3 Prediction with exponentially small loss 

Our main result is 

Theorem 1 For any e > ^Jl^ there exists an algorithm for which 

T 



gain > max 



1 
T 




2 T+1 



/VT, 



i.e. the algorithm has at most 4e regret against S+ and S- as well as a exponentially small loss. 
Remark 2 Note that by setting log(l/Z) = e 2 T, we get and algorithm with 

T- 1 ' 2 ^\og{l / Z) M-o(z 1 VT). 







T 


gain > max < 


Eh 






3=1 



The tradeoff between the loss and regret is optimal up to constant factors in the regret term: 

Theorem 3 Any algorithm that has average regret at most cy / ln(l/Z)T -1 / 2 incurs loss at least Z^T log(l/Z) 
on at least one sequence for some constant c > 0. 
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1.4 Combining strategies 

Our techniques can be used to obtain an algorithm that is competitive against a number of strategies, while 
maintaining bounded loss. Here and below we use the term strategies since our motivation comes from the 
bit prediction problem. Strategies are in fact equivalent to experts in this setting, i.e. our results are for the 
experts problem in the full information model. 

We prove TeoremlH which gives a family of algorithms with small loss and prescribed regret bounds to a 
set of prediction strategies Si, ... , Sn- The parameters of the algorithm can chosen to get desired (possibly 
different) regret bounds to Si, . . . , Sn- We use specific settings of parameters to get results in Remark [U 
Theorem [7] and Theorem PT2l below. 

Theorem 4 Let Si, ... , Sn be strategies and let T be a binary tree with Sj,j = 1, . . . , N at the leaves. 
Denote the number of left transitions on the way from the root to Sj by dj and the number of right transitions 
by dj. Then for any e > there exists an algorithm that satisfies 

~gldn~ > g~alh~(Sj) - d r j€ - + d^e'^^ /VT, 

for all j = 1, . . . , N. This can be achieved by a convex combination of strategies Si, ... , Sn- Note that if Si 
is the leftmost child, then the regret with respect to Si is exponentially small. 

Remark 5 In particular, it follows that for any 7 > there exists an algorithm for combining N strategies 
that has regret O^^Tilog N + log T) against the best of N strategies and loss at most 0({NT)^) with 
respect to any strategy fixed a priori. These bounds are optimal and improve on the work on H12\l . 

An interesting application of Theorem 0] is as follows. Consider a set of strategies Sj, where each Sj 
is a boolean function mapping | Sj | bits to one bit. The number of bits needed to specify a function Sj on 
k bits is comp(5j) = 2 k , and we call this the complexity of Sj. Our techniques yield an algorithm that is 
simultaneously competitive against all strategies of complexity up to any fixed K and satisfies 

gain > max jgain^) - O ^Tcomp(5 i ) log t\ ,o\ - O (T~ 2 ) . (2) 

A detailed derivation is given at the end of section [2] 

Remark 6 Let S be the input string and let Sj be the smallest Turing machine that produces S. Then the 
payoff of Sj on S is T, and by the guarantees above the algorithm gets payoff at least T — 0{comp{Sj)) when 
comp(Sj) = B(T/ log T). One cannot get regret better than comp(Sj)/ log T, so the algorithm is optimal up 
to a log 2 T factor in the regret in this setting. 

Results given above do not guarantee good regret against combinations of strategies, i.e. a compound 
strategy that uses different Sj's at different intervals. Such guarantees are given by 

Theorem 7 Consider the result of running the algorithm above on a sequence bf, using a set of strategies 
{Si, . . . , Sn}- Let I\, . . . , Ik, Ij = [Aj, Bj] C [1..T] be a covering of [1..T] by disjoint intervals. Then for 
any assignment of strategies to intervals rjj,l < j < k one has 

k 

gain > ^ 

i=i 

where S V -(I) is the cumulative payoff of strategy S v . on interval I. 

Remark 8 Note that the algorithm adapts to any structure of intervals 1^ and no knowledge of their number 
or size is required. Moreover, it follows by an application of Theorem\3\to each interval Ij separately that the 
guarantees in Theorem^7\are optimal. 
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1.5 Uniformity of the payoff sequence 

We first introduce the notion of smoothed payoff, which is an important tool in our analysis. Our algorithm 
will use a weighted sum of previous values bj, j < t to predict the sign of b t . The decay parameter, which we 
denote by p, can be interpreted as defining a window of size n = 1/(1 — p) since most of the weight mass 
is contained in the interval [t — 0(n),t\. We will refer to n as the window length in what follows. We first 
introduce a definition. 

Definition 9 For a strategy S we denote the payoff of S at time t by st and the smoothed payoff of S with 
window length n at time t by 

Note that the meaning of depends on the window length n = 1/(1 — p), which will sometimes be omitted 
if it is clear from context. Also note that we always have |s t | < 1. 

We now introduce the notion of Z-uniformity, which corresponds to the sequence of smoothed payoffs 
becoming 'noise-like' after the application of our algorithm. We start with 

Definition 10 A sequence Sj is Z-uniform at scale p = 1 — 1/n if one has Si P < c^Jn log(l/Z) for all 
1 <t <T, for some constant c > 0. 

Remark 11 Note that if the input sequence is iid Ber(±l, 1/2), then it is Z -uniform at any scale with prob- 
ability at least 1 — Z for any Z > 0. 

Let Si, ... , Sjv be a set of strategies. Denote the payoff of Si at time t is by Sij. Let T(S) be the linear 
comparison tree of S: the tree contains N log T levels corresponding to applications of strategies Sj with 
geometrically decreasing window size. We index levels of T(S) by pairs - strategy Si is applied with 
window length 2 J at level Denote by Sijj the payoff of the strategy obtained at level at time t. 

Let s* >t = s Nt i ogTtt . Then 

Theorem 12 Then the sequences s^t — s* $ are Z-uniform for any 1 < i < N at any scale p > 1 — 
1/(40 log(l/Z)) when Z = 0({NT)~ 2 ). 

Thus, after the application of our algorithm the sequence — s* does not exceed the standard deviation 
of a uniformly random variable in any sufficiently large window. Note that this is a significant strengthening 
of adaptive regret guarantees of lfT8l . A very surprising consequence of this result is 

Corollary 13 There exists an algorithm that outputs a prediction bt of an adversarial real-valued signal 
bt G [ — 1, 1] such that 

t t 
^(1 - l/n) j \b t - 6* | < min^(l - l/n) j \b t - z\ + O^nlogT). 

3=1 2 3=1 

Thus, the error in prediction is what one would expect to see with probability at least T - ®^ 1 ) if bt were 
uniformly random with mean z op t(t). 
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Figure 1: (Left) How should one predict the next bit? How should one weigh the recent bits vs the older bits? 
(Right) How confident should be in your prediction based on the deviation in the number of 0's and l's in the 
last n bits? Which of these confidence functions is best? 

2 Main algorithm 

In this section we state our main algorithm in the simplest form and give the corresponding analysis. A more 
general form of the algorithm, which allows to obtain regret bounds that depend on an Z p -norm of the sequence 
bt, is given in the section [3] 

We start by giving intuition about the algorithm by considering some natural approaches to bit prediction. 
What would one do to predict the next bit given the previous n bits (Figure [Q left panel). If all n bits are 1, 
should be predict 1? With what confidence? What if they aren't all 1 but there are many more l's than 0's? 
How should the prediction confidence depend on the imbalance x(= number of l's - number of 0's) What 
if there are more 0's but the last few bits are 1; should we give more weight to the recent bits? Figure [T] 
right panel, shows some possible confidence functions for predicting 1 based on x (for example, the weighted 
majority uses the tanh(x/\/T) function). We will devise a function that allows one to bound the loss. We 
will weight the recent bits higher. The i-th last bit will have weight (1 — 1/n) 1 ; Thus, in order to predict bi+\ 
from the previous bits, we compute the discrepancy x = J2j>o — 1/w)- 7 . Our confidence function g(x) 

will be essentially zero until x > y/n after which it shoots to 100% very fast. We can show that with this 
confidence function we will never incur a significant loss. 

Our algorithm will have the following form. For a chosen discount factor p = 1 — 1/n, < p < 1 the 
algorithm maintains a discounted deviation xt = Y^j=i P f ^j at eacn ti me t = 1, . . . ,T. The value of the 
bet at time t + 1 is then given by g{xt) for a function g(-) to be defined. The function g as well as the discount 
factor p depend on the desired bound on expected loss and regret against S+ and S- . 
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We now describe the algorithm. 
Algorithm 1: Bounded loss prediction 

1: Xl <r- " 

2: for t = 1 to T do 

3: Bet on sign(g(x 4 )) with confidence 
4: Set xt+i <- px t + b t . 
5: end for 

For t = 1, . . . ,T define $ t = |/ a< We will use $t as a potential function, and the payoff of the 

algorithm will be closely related to the change in & t - The function g(x) will be chosen to be a continuous 
odd function that is almost zero for \x\ < L, and equal to 1 for x > U and to —1 when x < — U, for some 
values < L < U < T. Thus, we will have that \xt\ — U < 5>t < \xt\. Intuitively, <I>t captures the imbalance 
between the number of — l's and +l's in the sequence up to time t. 

We now give an informal sketch of the proof, which will be made precise in Lemma [T4l and Lemma [T51 
below. We will choose the confidence function g{x) so that 

rxt 

$ t = / g(s)ds 
Jo 

is a potential function, which serves as a repository for guarding our loss. The condition that & t is a potential 
function is that the change of <£ t must lower bound the gain. If we let & t = G(x t ), where 

rx 

G ( x ) = / g(s)ds, 
Jo 

we have 

$ m - $t = G{x t+l ) - G(x t ) « G'(x)Ax + G"(x)Ax 2 /2 « g(x) [{p - l)x + bt] + g'(x)/2. 
Since the gain of the algorithm at time step t is g(x t )bt, we have 

A$< - g(x t )b t = -g(x t )(l - p)x t + g'(x t )/2, 

so the condition becomes 

-g(x t )(l-p)x t + g'(x t )/2<Z, 

where Z is the desired bound on per step loss of the algorithm. 
Solving this equation yields a function of the form 

g{x) = Ze x2 l^ T \ 

In what follows we make this proof sketch precise. We will write g'{x) for functions g whose derivative 
may have jump discontinuities. However, this will not create any confusion since g'(x) appears in the form 
max ie[A,B] 9'{ x )> where we take the limit from the right at A and the limit from the left at B. We always 
have p = 1 — 1/n and use the notation p = 1 — p. The statement of Lemma [141 involves a function h(x) 
that will be chosen later. For ease of understanding one may think of h{x) = 1/2 throughout the proof of the 
lemma, even though we will use a different h(x) later. 

Lemma 14 Suppose that the function g(x) used in Algorithm\l}satisfies 

max \g (s')l(s' - xf/2 < pxg(x)h(x) + Z' 

s'&[px— l,px-{-X] 
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for a function h(x),0 < h(x) < l,\/x, for some Z' > 0. Then the payoff of the algorithm is at least 

T 

J2px t g(x t )(l - h{x)) + $ T - Z'T 
t=i 

<35 /orcg <35 |6j | < lfor all t. 
Proof: We will show that at each t 

$t+i - *t < hg(xt) + Z'T - px t g(x t )(l - h(x t )), 



i.e. 



^2hg(x t ) > -Z'T + ^pxtg(xt)(l - h{x t )) + $ T - $i, 



t=i 



thus implying the claim of the lemma since <I>i = 0. 

We consider the case xt > 0. The case x t < is analogous. In the following derivation we will write 
[A, B] to denote [min-j^, B}, max{i, B}]. 



< bt < 1: Wehavext+i = pxt + bt = xt — pxt + h, and the expected gain of the algorithm is g (x t )bt- Then 

i-xt—pxt+bt 
J x t 

< g{x t )b t + 

< g{x t )b t + (-1 + h(x t ))pxtg{x t ) + Z'. 
— 1 < 6 t < 0: This case is analogous. 



#(s)cZs < a{xt){-px t + b t ) + max ^'(s^lO' - x t ) 2 /2 

s'e[xt,x t —pxt+bt] 



-g(x t )px t + max \g (s )\(s - x t ) /2 

s'e[x t ,a; t — pz t +ft t ] 



We now define to satisfy the requirement of Lemma[l4] For any Z, L > and [/ = 2L-\/log(l/Z) let 

g(x) = < 



1 



-1 



x > U 
x < -U 



(3) 



and let p = 1 — 1/rt. We choose 



h{x) 



1, |x| < U 
o.w. 



(4) 



The following lemma shows that the function g{x) satisfies all required properties: 



Lemma 15 Let L > be such that p = 1/n > 1/L 2 . Then for n > 401og(l/Z) the function g(x) used in 
Algorithm\T\satisfies 

max \g'(s)\(s — x) 2 /2 < pxg(x)h(x) + Z, 

s£[px — l,px+l] 

where h(x) is the step function defined above. 
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Proof: The intuition behind the Lemma is very simple. Note that s G [px — 1, px + 1] is not much further than 
1 away from x, so </(s) is very close to g'(x) = (-S^)g(x). Since p > 1/L 2 , we have g'(x) = (-S^)g(x) < 
pxg(x)/2. In fact the function g(x) was chosen to satisfy this differential equation. In what follows we make 
this intuition precise. 

First note that g'(s) = for \s\ > U, and g'(x) < Z when x < L so we only need to consider 
L - 1 < x < U/p+1. 

We have \g'(s)\ < (s/2L 2 )g(s) for |s| < U/p + I. Thus, 

max < I max (s/^) I ■ I max g(s)/g(x) I • 

s(z[px— l,px+l] \s£[px— l,px+l] J \s£[px — l,px+i\ J 

<(1 + 1/(L-1)) max / 2 < s -^s-x?)/2L* { x )g{x) (5) 
se[px-i,px+l] IL L 



< (1 + 1/(L - !))(! + (2x(p + 1) + (p + lf)/2L 2 )(^)g(x) 



since x < U/p + 1. 

Finally, when x < + 1, 



Putting (f5]) and © together, we get 



max (s' - xf < {px + l) 2 < (pU/p + l) 2 . (6) 



max \g'(s')\(s' — x) /2 < pxg(x) 

s'(z[pX— 1,/OS+l] 



for L — 1 < x < + 1. Also, we have 

m 

s'G[px— l,px+l] 



max tf(8')\(J -x) 2 /2< Z 



for x < L, which gives the result. 

Remark 16 ft can &e verified that if one has 



g{x) 



e 



VVfZx/L, \x\<L 



e _ 
_ e io g z+(^)2 L< _ x<u , (7) 



1 X > f/ 

-1 x <-U 



the function g(x) satisfies 



iiiiK l5 ,/ (s)|(s — x) 2 /2 < pxg(x)h(x) + epLZ. 



s£z[px— l,px+l] 
We will use this function g(x) in what follows. 

We prove the regret bound of our algorithm in terms of the notion of smoothed payoff. This also allows us 
to conclude that the algorithm gets positive payoff whenever the imbalance is high in at least one contiguous 
window of size n. We will use the notation 

0, \x\ < e 
\x\, \x\ > e 
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Theorem 17 Let n is the window size parameter of Algorithm^ Then its payoff is at least 



T-l 

E\x t \ + , + \x T \ + , lp - eZT/Jn. 
1 * l 2Vlog(l/^)/n ' l 2^log(l/Z)/n /,J /V 



t=l 



Proof: By Lemma [T5l and Remark [T6l we have that the function g(x) satisfies the conditions of Lemma [14] 
and so we get that the payoff of the algorithm is at least 



T-l T-l 



p\x t \u + $T>Y,px t + x T - pTU - U 



t=i t=\ 



Setting U = 2y / T log(l/Z) and p = 1 - 1 /T, we get the desired statement. ■ 
Note that the payoff of the algorithm is positive whenever the imbalance is higher than, say, 4\/ n log T in at 
least one window of size n. Similarly, we can now give 

Proof of TheoremQ} In light of Theorem[T7]it remains to bound Ylt=i P x t + X T- We have 

T-l T-l t T-l T 

pY. xt+XT = pY,Y, +x t = Y1 bt ^ - p T ~^ +x t = Y1 bt - ^ 

t=l t=l j=l t=l t=l 

Thus, we get using Theorem [TT] that the payoff of the algorithm is at least 

T-l T 

^2 pxt + x T - Ay/T\og{l/Z) - epLZT >^b t - AyjT\og{l/Z) - eZVT. 
t=i t=i 



Remark 18 It is interesting to note that as follows from Theorem U7\ Algorithm\I}gets positive payoff if the 
discounted deviation x% exceeds O ( \/T\og(l/Z) J + 0(Z\/T) for at least one value oft between 1 and T. 



Proof of Theorem |3l Let A be an algorithm with regret at most a/T ln(l/Z) with respect to S+. Consider 
a sequence X t = Ber(±l, 1/2) of independent random variables. The payoff of S+ is equal to Ylt=i Xt- 
Since for some constant c > 



Pr 



^X t >2 v / Tln(l/Z) 



t=i 



> Z c 



we have that A gets payoff at least y / Tln(l/Z) with probability at least Z c . Since the expected payoff of 
any algorithm on this sequence is equal to 0, A incurs loss at least Z c ^jT \og(l/Z)/ (1 — Z c ) on at least one 
sequence. This gives the statement of the theorem after chosing Z' = Z c for a suitable constant c > 0. ■ 
We now give the details of the bounds stated in (0. Consider a set of strategies Sj, where each Sj is a 
boolean function mapping | Sj \ bits to one bit. The number of bits needed to specify a function Sj on k bits 
is comp(S'j) = 2 fc , and we call this the complexity of Sj. Then by choosing an unbalanced tree with Sj at its 
leaves so that <£ = j and (E = 1 in Theorem 01 and use Zj = l/j 2 at the j-th comparison node. Thus, we 
get an algorithm that is simultaneously competitive against all strategies of complexity up to any fixed K and 
satisfies 

gam > max |gaE(£j) - O ^Tcomp^-)) , j - O (VT^j , 
where we used the fact that Yjl=i l/^ 2 = 0(1). 
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3 Discounted payoff 



In this section we state our algorithm in the most general form and derive bounds on its discounted payoff. 
We now describe the algorithm: 

Algorithm 2: Bounded loss prediction 



X\ <— 

for t = 1 to T do 

Bet on sign{g(x t )) with confidence \g{x t ) 
Setx m <- UPDATE(x t ,b t ). 
end for 



Here we use the function UPDATE(xt, bt), which returns ptXt + h, i.e. uses discounting factors that 
in general depend on the time step. Note that the exact form of p t in Algorithm [2] is not specified. Different 
setting of p t discussed below yield different guarantees. 

The analysis relies on the following two Lemmas, which are analogous to Lemma [T4l and Lemma IT31 in 
section |2] The proofs are given in Appendix lAl 

Lemma [14b Suppose that the function g(x) used in Algorithm\7]satisfies 

max \g'{s')\(s' - x) 2 /2 < p t xg{x)h{x) + Z' 

s' e[px-\b t \,px+\bt\] 

for a function h(x),0 < h(x) < 1, Vx, Z' > 0, and all < A < 1. Also, suppose that 



< pz\g(z)\/2 



i] I g(s)ds 

for any rj > rj*. Then 

1. the payoff of the algorithm satisfies 

t t 
Y,9(h) h i > Y.Pi x i9( x i)( l ~ Hx)) + ®t- Z't 

3=1 3=1 

2. if p t = p, then at each time step 1 < t <T the r\-smoothed payoff of the algorithm satisfies 

t 

Y J ri t ~ 3 g{b j )b j >^t-z'/{i- P ). 

3=1 



as long as\bt\ < lfor all t. 

The following lemma shows that the function g(x) defined in © satisfies all required properties: 
Lemma ITSli Let L > be such that p > A 2 /n, 1/n > 1/L 2 . Then for n > 401og(l/Z) the function 

g(x) defined in (J3J) satisfies 

max \g' (s)\(s — x) 2 < pxg(x)h(x) + Z, 

s£[px—A,px+A] 
where h(x) is the step function defined above. 

Remark 19 It can be verified that the function g(x) defined in (0 satisfies 

max \g'(s)\(s — x) 2 < pxg{x)h{x) +epLZ. 

s£z[px— l,px+l] 
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Note that when the discounting parameter is allowed to depend on time, the discounted deviation takes 
form 

t t-i 

j=i i=j 

The following crucial property is a more general version of ((8]): 
Lemma 20 

T-1 T 

- Pj )xj + x T = J2 b r 
3=1 i=i 

Proof: Induction on T. 

Base case:T = 1 The statement is true since x\ = b\. 
Inductive step: 

T-1 T-2 

5^(1 - Pj)Xj +X T = y~]Q- ~ £j) X 3 + ( 1 ~ PT-l)XT-l + PT-lXT-1 + b T 
3=1 3=1 

T-2 T-1 
= ^(1 - + XT-l + 6t = ^ + 

3=1 3=1 

where we used the inductive hypothesis in the last step. 



Remark 21 77je range of allowed discounting factors [rj* , 1] depends on the function g and the function 
UPDATE used in the algorithm. For Algorithm\2\any r/* > 1 — (1 — p)/2 is valid trivially. However, for 
the update function given in section^ any rj* > p will also be valid, yielding better guarantees. 

Regret bounds obtained in the previous sections depend on the best upper bound on \b t \ that is available: 
in fact, we assumed that the input is scaled so that \bt\ < 1. Thus, the bounds on the regret scale linearly with 
1 16| |oo. This is tight up to constant factors, as shown in Theorem [3] It is natural to ask if better bounds can be 
obtained if the sequence b t has small l p norm for some p > 0. We will use the notation p p (b) := X)fc=i b\. 
By choosing p t = 1 — \b t \ p /n, we get 

Theorem 22 Let bt be the sequence of payoffs such that \bt\ < M and p p (b) > 401og(l/Z) (this can be 
achieved by rescaling the values ofbt if necessary, thus increasing the value of M). Then one can obtain 
regret at most MW p> p (b) log(l/Z) and loss at most MZ^J p, p (\i)for any p < 2. 

Proof: Fix < p < 2. Note that all regret and loss bounds obtained so far scale linearly with M. The loss 
property follows immediately from Lemma IT4k by setting p(b) = 1 — \b\ p /n, n = p*. 
Using Lemma [T4k and Lemma|20l we get that the regret is at most 



O l^M^n log(l/Z) + M\J n log(l/Z) ^ p t j = O [Mp p (b) y^" 1 log(l/Z) + M^n\og{l/Z) 
Setting n = p*, we get regret O \ M ' p* log(l/Z)) . 
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4 Combining strategies 



In this section we show how Algorithm Q] can be used to combine any two strategies Si and 52 and generalize 
this to arbitrary number of strategies, thus proving Theorem HI 

We start with the case of two strategies Si,S%. Our algorithm will consider Si as the base strategy 
(corresponding to the null strategy So strategy in the previous section) and will use 52 to improve on Si 
whenever possible, without introducing significant loss over Si in the process. Note that from this point on we 
deal with general strategies that are represented by payoff vectors, as opposed to bit sequences. Algorithm [T] 
which was developed for predicting bit sequences, will be used in the reduction. 

Let sij, szj 6 1] be the payoffs of Si and £2 respectively at time t. Define 



g{x) 



g(x/2), x>0 
o.w, 



It is easy to see that g(x) satisfies the conditions of Lemma [141 with h(x) as defined in dU). The intuition 
behind the algorithm is that since the difference in payoff obtained by using S2 instead of Si is given by 

(s2 — si), it is sufficient to emulate Algorithm [T]on this sequence. In particular, we set xt = (si — si) t /(l — p) 
and bet g{x t ) (note that since \si — s 2 \ < 2, using g(x/2) in the definition of g is sufficient). Betting 
corresponds to using Si, betting 1 corresponds to using S 2 and fractional values correspond to a convex 
combination of the two (which can be interpreted as a distribution over Si, S2). 
Formally, the algorithm takes the following form: 

Algorithm 3: Combining two strategies 



xi <- 

for t = 1 to T do 

Set St <- S u (l - g{x t )) + S 2 ,tg{x t ). 
Set x t +i <- px t + (s 2 ,t - si tt ). 
end for 



Denote the coefficients of Si in S* at time t by pij. We have 
Lemma 23 For all I <t<T 

j=i j=i j=i 



In particular, after setting p = 1 — 1/T and U = \\/T log(l/Z) as before, we get 

G > Si + max^ - Si - 0(y/T\og(l/Z)), 0} - O(zVT) 

Proof: Algorithm [3] amounts to applying Algorithm Q] to the sequence (S2 — si), and hence by Theorem IT7l 
the payoff of Algorithm |3]is at least 

t t t 

/~2(si,j + (S2,j - sij)g(xj)) = sij + E( S2 J ~ s hj)9(xj) 
j=i j=i j=i 

3=1 3=1 

This immediately yields (2), and we get (1) by setting parameters as stated. ■ 
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It is important to emphasize the property that Algorithm [3] combines two strategies Si and S2, improving 
on the performance of S\ using S2 whenever possible, essentially without introducing any loss with respect 
to Si. 

Algorithm [3] can be used recursively to combine N strategies Si,..., Sjv- Consider a binary tree T with 
Sj at its leaves. Each interior node m£T can run Algorithm [3] using the left child as Si and the right child as 
52 as specified in Algorithm [5] 

Algorithm 4: COMBINE(u) 
1: if v is a leaf then 
2: return S v 
3: else 

4: Si <r- COMBINE(left(w)) 

5: S r <- COMBINE(right(v)) 

6: S v ^ Si(l - g(x t (v))) + S r g(x t (v)) 

7: x t+ i(v) <- px t (v) + (s r , 4 - si tt ) 

8: return 5(u) 

9: end if 

Algorithm 5: Combining multiple strategies 

1: For each internal node v G Tlet 0, <— 

2: for t = 1 to T do 
3: COMBINE(root) 
4: end for 

Proof of Theorem |2 Run Algorithm [5] on 7 '. The guarantees follow using Lemma [23] Note that the regret 
with respect to Sj is given by the number of right transitions from the root to Sj times -y/Tlog(l/Z), and the 
loss is given by ZyT times the level of Sj in T ■ ■ 
We now bound regret against a combination of strategies from the set {Si, . . . , SV}. We will use the 
following modified version of Algorithm [6] 

Algorithm 6: UPDATE^, b t ) 

l: if \x t \ <UV g(x t )b t < then 
2: return p{b t )x t + b t 
3: else 

4: return p(b t )x t 
5: end if 

It is easy to see that the same regret and loss bounds hold for Algorithm [6j 

Lemma 24 Discounted loss of Algorithm\6\is at most ZT{\—p) 1 / 2 and the regret is at most (Ty / log(l/Z)/n+ 
y/n \og(\/Z), where p = 1 — 1/n. 



Lett? := bt — b\. Note that running Algorithm [6] on b is equivalent to running Algorithm Q] on b° and 
additionally getting all payoff from b*. Thus, all loss and regret bounds of Algorithm Q] apply. Moreover, the 
discounted loss and regret bounds apply with any 77 > p since one has 



Proof: Let 




bt, if line 2 of Algorithm [6]is executed at time t 
o.w. 




for allz < U + 1 if n > 401og(l/Z). 
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Note that the regret with respect to a strategy Si incurred by Algorithm [5] is proportional to the number 
of right transitions on the way to Si from the root. Hence, in what follows we will be using the linear 
combination tree, where each Si is located at depth i, and there is exactly one right transition on the way from 
the root to Si for each i. This is equivalent to starting with strategy Sn and consecutively improving it with 
Sj , j = N — 1, . . . , 1. Thus the loss incurred wrt Sj is 0{jZTn~ 1 ' 2 ), where n is the window size, and the 
regret wrt Sj is 0{{T/n 1 / 2 + n 1 / 2 ) 0og(l/Z)). 

Theorem 25 Consider the result of running Algorithm\5\on a sequence bt, using a set of strategies {Si , • • • , Sn}. 
Let I±, . . . , Jfc, Ij = [Aj,Bj] C [1..T] be a covering of [1..T] by disjoint intervals. Then for any assignment 
of strategies to intervals rjj, 1 < j < k one has after setting n = y/T/k 

gain > maJ^^) ~ \Ariog(l/Z),CjJ - O(ZNT). 

Proof: 

Using Lemma [23] we get that 

k B 3 

gam>EE s V-°( zivr )- 

3=1 t=Aj 

Fix j and denote r = s^^—i. 

E ^ = a - p) E E + a - p) E ^ Aj ' +1 - 

>(i-p) E E ^ - r = E s %-.*'d - ^ J-m ) - r w 

t'=Aj 1=0 t>=A i 

^ E - 3[/ 



since by assumption \r\ < U, where U = 4-v/nln(l/2). 

Since every comparison that uses a strategy as a base incurs loss at most ZT, we have 

k Bj k Bj 

gain > E E Sj(t) -O(ZNT) > E E s Vj (t) - 2k v / nlog{l/Z) - 8Ty / ln(l/Z)/n - O(ZNT). 
3=1 t=Aj j=l t=Aj 

Set n = T/k. Then regret is bounded by 

2k^(T/k) log(l/Z) + 8TVfcln(l/Z)/T < O (yj kT \og{\ / Z) 
as required. 

■ 

Theorem[25]shows that sublinear regret can be achieved if the number of intervals n is known. The neces- 
sity of knowing the window size limits the use of such an approach when the input sequence is inhomogenous, 
i.e. exhibits short-term as well as long-term phenomena. We now give an algorithm that adapts to the input 
without knowing the sizes of intervals in advance. 
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Define the linear comparison tree T c as the tree has height O(NlogT) and strategy Sj is applied with 
window size 2 k , k = 1, . . . , logT, at level N(k — 1) + j. There is exactly one right transition on the way 
from the root to any Sj. 

Proof of Theorem S As in the proof of Theorem |25l we need to relate © to the sum of payoffs of S rij on 
Ij. Consider the application of strategy S„. with window size 2 r J , where 2 r i < \L\ log(l/Z) < 2 r -? +1 . Then 



one has as before 



k B 3 

§ ain >EE S %>* O(ZNTlogT), 
3=1 t=A 3 



where U = y/Ti ln(l/Z). 
Thus, we have that 

k Bj k 

3=1 t=Aj j=l 
k 

E 

J=i 

by the choice of rj. 



B, 



t=A 



> max < 



]T s vj (t) - log(l/Z) - 0(\Ij\^/2-^ ln(l/Z)) - 0(ZNT logT) 
I -O(ZNTlogT) 



S m {Ij)-0\J\Ij\\og{l/Z) 



5 Uniformity of the payoff sequence 

In this section we study the role of the discounting parameter p in Algorithm Q] and prove that the algorithm 
takes advantage of any significant deviation of the sequence of payoffs from random in window of any size. 
Recall that by Definition [9] the smoothed payoff of a strategy S is 

t 

~4 = {i- P )Y J P t - j s j 

3=1 

We first prove a lemma that relates a sequence smoothened with parameter < p\ < 1 to the same 
sequence smoothened with any p2 < p\- We have 

Lemma 26 For any p\ > p2 the p\-smoothed payoff at time t is a convex combination of pi-smoothed payoffs 
at time j < t: 



1-/02 



^ + Y,^-jPl j ~\pi-P2) 

3<t 



Proof: We verify that the coefficients of st-j in lhs and rhs coincide. The coefficient of s t -j in lhs is 
(1 — pi)p±- The coefficient in rhs is 



:i-pi) 



(i-pi) 



t-i 

3 , \ " t'-t+j t-t'-li \ 
P2 + P2 Pi (Pi ~ PV 

t'=t-j 

3-1 



pi + ipi-p^PlPl^ 



k=0 



1 - pi)p[. 
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The coefficients in the rhs sum up to 



P2 



1 + Z^Pi(pi -P2) 



1-/02 



1 + 



Pi ~ P2 



1 



Pi 



We now give the desired characterization of the output sequence. 

Let Si, ... , Sn be a set of strategies and let T(S) be the linear comparison tree of S. The payoff of 
Si at time t is denoted by s^ f . Recall that the tree contains iVlogT levels corresponding to applications of 
strategies Sj with geometrically decreasing window size. Levels are indexed by pairs - strategy Si is 
applied with window length 2 J at level (£, j). Denote by sij t the payoff of the strategy obtained at level (i, j) 
at time t. Let s*^ = S7v,iogT,t- We give the proof of 

Theoremll2ir/;g sequences (sj t~ s *,t) are Z -uniform for any 1 < i < N at any scale p > 1—1/(40 log (1/Z)) 

as long as Z = 0{{NT)~ 2 ). 

Proof: 

We start by showing that (s^ — s*^) are Z-uniform at all scales n = 2 J ; , 1 < j < log T, n > 40 log(l/Z), 
corresponding to discount factor p = 1 — 1/n. Consider the application of s^t at windows size n = 2 jf , denote 
the payoff of the base strategy for s^t by sqj and denote the coefficient in the convex combination by g t , so 
that Sj ^t = so,t + 9t{si,t — So,*)- Then one has by Lemma [14a 



J2p t j (s ,t+gt(s 

j=0 



S0,t) ~ 8i,t) 



J=0 



(10) 



< x t - $ t + 2t/Vn < 0(Vnlog(l/Z)) + 0(Zt/Vn), 



where we used the fact that Yl^oP 



SLt 



so,t) is exactly the discounted deviation x t at time t, and 



£}=o - > *t - 0{Zt/y/n) by LemmaER. 

We have shown that the sequence (sf j $ — s^t) is Z-uniform at scale 2 J after the application of Si t at level 
j, and it remains to show that this property is not destroyed by the subsequent combinations in T(S). By 
Lemma [14a one has that for any t 



zZp^i^t ~ s*,t) < -O(ZiVTlogT), (11) 

3=0 

and hence by combining (flOl and (fTTT > we get 

t 

j^'-'fot - s*, t ) < 0(v/nlog(l/Z)) - 0(ZiVT log T). 
i=o 

We now show that the sequence (s^ — s*^) is Z-uniform at any scale. Consider a value of p ^ 1 — 1/2 J . 
Let / > be such that pi/2 < p < p\. Set ni = (1 — p/)~ 1 .By Lemma [261 one has for any sequence 6 



6" < 



1 



b P t l +Y, h t- 3 Pi 3 '\p-Pi) 

3<t 



where the coefficients in the rhs are non-negative and sum up to 1. Thus, setting 6 = Sj j — s*^, we get the 
desired conclusion for all p > 1 — 1/(40 log(l/Z). Thus, the discounted deviation is 0(yn log(l/Z)) as 
long as Z = 0({NTy 2 ). ■ 

Corollary 27 Suppose that we are given a set of strategies Si, ... , Sjy. Then by alternating Si with the 
null strategy in the comparison tree we can ensure that the final sequence of payoffs is nonnegative in every 
window and is Z-uniform wrt each Si at any scale p>l — 1/(40 log (1/Z)) as long as Z = 0((NT)~ 2 ). 
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6 Risk-free assets, transaction costs and high probability bounds 



The results on predicting a bit sequence without loss given above yield the following construction of risk-free 
assets. Let vt be the expected price of a stock at time t and let xt = log(ut) — log(vi). Let the expected 
rate of return of the stock be r, i.e. E[xt] = rT. We make an important simplifying assumption that the 
percentage change in the price of the stock is bounded, for example \x t \ < 1. Running our algorithm on the 
sequence x t produces a sequence of confidence values g t , t = 1, . . . , T, which are interpreted as a signal 
to buy if g t > and sell otherwise, where \g t \ specifies the amount of stock to buy /sell. The bounded loss 
property now implies that this investment strategy does not lose more than Zy/T of the initial capital at any 
time t = 1, . . . , T. The regret property means that the rate of return of the investment strategy is at least 

r - viog(i/z)/r - z/VT. 

This construction assumes zero transaction cost and the ability to trade fractional amounts of shares. 
However, transaction costs may make these assumptions unrealistic. This motivates introducing randomness 
into the process and interpreting \g t \ as the probability of buying/selling rather than the actual amount. The 
guarantees on expected regret carry over immediately, but it becomes desirable to have a high probability 
bound on the regret and loss in this setting. This motivates introducing randomness into the process and 
interpreting \g t \ as the probability of buying/selling or doing nothing rather than the actual amount that the 
strategy buys/sells at each point in time. The guarantees on expected regret carry over immediately, but it 
becomes desirable to have a high probability bound on the regret and loss in this setting. We show that our 
algorithm has bounded loss and good regret with high probability. 

Consider the function g(x) defined as follows: 

Z(x/eT), x<eT 
g(x) = { Ze {x - eT "i 2 /^ T \ eT < x < eT + 4 v /Tlog(l/Z) (12) 
1 o.w. 



and 



It is easy to see that 



1, x < eT 

h{x) = { 1/2, eT < x < T + 4 v A Tlog(l/Z) 
o.w. 



max g'(s')(s' — x) 2 /2 < pxg(x)h(x) + Z. 
s'(z[px— l,px+X] 



We get regret at most 2eT by the same arguments as in Theorem Q] We now prove the high probability 
bound on the loss given in Theorem [291 
We prove 

Theorem 28 Let < c < lbe a transaction cost. There exists a randomized strategy that yields exponentially 
small loss and expected regret at most (1 — 4c)T in the presence of transaction cost ofcper trade. 

Proof: Set e = 2c in (fT2l . Note that the expected transaction costs incurred are J2t=i c 9( x t)- We have 

T T 



gain > ^2 px t g(x t ){l - h(x t )) - ^2cg(x t ) - 2cZT 
t=i t=i 

> ^2 [pxtg{x t )/2 - cg(xt)} + ^2 \P x t9{xt) - cg(x t )] - 3cZT 

t:2cT<x t <3cT t:x t >3cT 

> ^2 [P x td{xt) - cg(x t )} - 3cZT 



t:x t >3cT 
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Thus, the gain is at least — 3cZT even after discounting transaction costs, and the regret is at most 4cT by the 
same argument as in Theorem [Q which we do not repeat here. ■ 
We also show that we can get essentially zero loss in expectation in the presence of transaction costs: 

Theorem 29 The loss of Algorithm\l}achieving regret eT is O (log(l / 5) / e) for any 1 < t < T with proba- 
bility at least 1 — 8 for any 5 > 0. In particular, by setting 6 = 1/e and terminating the algorithm if the loss 
is larger than 0(log(l /e) / e), we get an algorithm with regret at most AeT and loss 0(log(l /e) / e). 

Proof: Let x% be the sequence of discounted deviations, and let Xj, 1 < j < T be the ±1 random variables 
corresponding to the bets that the algorithm makes. We need to show that 



Define 



We also have 



Pr 



-21og(l/<f)/e 



< 5. 



t 



E 



Since the probability of making a bet when pxj < e is at most Z < 1/e, we have that the payoff cannot be 
smaller than — log(l/<5) with probability larger than 1 — 5. Otherwise, when px > e, we have of < fi t /e . 
We have by Bernstein's inequality 



Pr 



^X,<-log(l/J)/ e 



< exp 



i2 i 



Since 



{pi t + \og{l/5)/ef 



> 



{pit + log(l/g)/e 
(T t 2 +log(l/(5)/3e 

(/Z i+ l0g(l/5)/6) 2 



of + log(l/<5)/3e " fH/e + Iog(l/5)/3e 



> log(l/<5), 



we get the desired result. The last inequality can be verified by considering two cases: p, > log(l/<5)/3 and 
p < log(l/(5)/3. ' ' ■ 

Theorem|29]is optimal up to constant factors: 

Theorem 30 Any algorithm achieving regret eT with ±1 betting amounts incurs loss fi(log(l/<5)/e) with 
probability at least 1—5, for any 5 > 0. 

Proof: Let bt be iid Ber(±l, ^4£) and denote the confidence of the algorithm at time t by gt. Let T be the 

(random) maximum time such that J2j=i 9j < l/ e2 - Thus, J2j=i 9j — ^/ e2 ~ 1- 
One has 

T 

E b E a i g ^2 9th < 1/e, 

where E& denotes expectation with respect to b and E a i g denotes expectation wrt the randomness of the 

algorithm. Thus, there exists an input sequence b* for which E a i g YlJ=i 9tK — l/ e - I n particular, with 

probability at least 1/2 one has Ylj=i 9t^t — V e - 

Thus, we have that on the sequence b* the expected return of the algorithm is at most 2/e with probability 
at least 1/2 (over the coin flips of the algorithm). The payoff of the algorithm is then a sum of Bernoulli 
variables with expectation at most 2/e and variance X^J=i 9t ^ (V e2 — !)■ Thus, at time T the loss is as 
large as r 2(log(l/5)/e) with probability at least 5/2 for any 5 > 0. ■ 
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7 Applications 



In this section we show applications of our framework to two problems in online learning: the adversarial 
multi-armed bandit problem in the partial information model and online optimization. 



7.1 Partial information model 

We show that a simple application of our framework can be used to obtain an algorithm with sublinear regret 
and essentially zero expected loss with respect to the average of all arms. In particular, for any Z < T~ 2 we 
will obtain an algorithm with regret 0(A rl / 3 T 2 / 3 (log(l/Z)) 1 / 3 ) and expected loss at most ZT with respect 
to the average of all arms. 

Let the rewards of N arms be given by Xi(t), 1 < i < N, 1 < t < T. We will define probabilities pi(t) of 
sampling arms i at time t inductively. Let 7 > be a parameter to be fixed later. Let I t be the (random) arm 
played at time t. Define 

Xi{t)/pi(t), if I t = i 
o.w. 



Xi(t) 



Define Pi(t) as follows: 
t = 1 Pi (t) = 1/N for alH = 1, . . . , N. 

t — > t + 1 Consider the sequence of payoffs Xi{t) as a full information problem (thus, at each time step t all 
Xi(t) except for possibly the one that was played are zero). Note that < -/V/7 since Pi(t) > 

7/2V,V£,i. Run Algorithm [5] on this sequence after scaling it down by a factor of iV/7 using p(b t ) = 
1 — \bt\/n. Let ri(t + 1) be the probability of playing arm i at time t + 1 given by Algorithm [5] Let 

Pi{t + \) = (l-7)r;(i + l)+ 7 /iV. 

We have 

Lemma 31 For any Z < (NT)~ 2 the expected regret of the algorithm is at most (9(iV 1 / 3 r 2 / 3 (log(l/Z)) 1 / 3 ) 
and the expected loss is O(ZNT). 

Proof: First consider the auxiliary full information problem. By theorem [22] with p = 1 the regret is at most 



o(u + u pt) = o(u + uJ2(i/ N )ti(t)H 



Set n = Tj/N,U = y/T^/N) log(l/Z). Then the expected regret is 



E 



0\U + uY^n/N)xi{t)/n 



t=i 



o 



(Vn-y/N) iog(i/z) 



where we used the fact that E[£j(t)] = Xi(t). 
Thus, the final expected regret is at most 



O (7T + (iV/7)^T(7/iV)log(l/Z)) = O ( 7 T + ^iVTlog(l/Z)/ 7 ) , 
where 7T comes from the fact that the algorithm pulls a uniformly random arm with probability 7. Setting 

!log(l/^)) 1/3N 



y 3 / 2 = y/Nlog{l/Z)/T, we get 



O 
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The expected loss with respect to the average of all arms is 

0{{N/i){ZT/U)) = O(ZNT) 

m 

It is interesting to note that unlike the full information model, one cannot achieve essentially zero loss 
with respect to an arbitrary strategy in the partial information model. 

7.2 Online optimization 

We first note that our techniques yield algorithms in the online decision making framework of [21 ] that have 
optimal regret with respect to dynamic strategies. We do not state the guarantees here since the exposition 
in rt2~T1 is quite similar to the experts problem. One interesting consequence of our analysis that should be 
noted is as follows. Let xt,t = 1, ... , T be an adversarial real- valued sequence, x t £ [ — 1, 1] presented 
to the algorithm in an online fashion. Then a straightforward application of Theorem [12] implies that one 
can approximate the signal xt by xt so that the cumulative deviation of xt from xt in any window of size 
n = £l(log T) is not greater than 0(\/nlogT), i.e. the deviation that one would expect to see with probability 
1 _ ^6(1) if the difference were uniformly random. 

We now show how our framework can be applied to online convex optimization methods of f28l . We 
start by defining the problem. Suppose that the algorithm is presented with a sequence of convex functions 
c t : F C W 1 — > K, t = 1, . . . , T. Denote the decision of the algorithm at time t by xt- The objective is to 
minimize regret against the best single decision in hindsight: 

T T 

} \ct(x t ) - max^c t (x) 
t=i xe t=i 



If the functions a are convex, gradient descent methods can be used in the online setting [28] to get 
efficient algorithms. We state the greedy projection algorithm here for convenience of the reader: 



Algorithm 7: Greedy projection algorithm (112811) 


1: Select x\ £ F arbitrarily, choose a sequence of learning rates rj t , t = 1, . 




2: for t = 1 to T do 




3: Set x t+ x <r- P(x f - rjtVc^xt)). 




4: end for 





Here P is the orthogonal projection operator onto F. In what follows we use \ \F\\ to denote (an upper 
bound on) the diameter of F, and 1 1 Vc| | to denote an upper bound on the norm of the gradient of q on F. 
One has 



Theorem 32 ([28]) The greedy projection algorithm with i]t = t 1 / 2 has regret at most \\F\\ 2 \/T/2 + (\/T— 
l/2)||Vc|| 2 . 

The following notion introduced in [28 ] parameterizes dynamic strategies in the online gradient descent 
setting: 

Definition 33 (/[28l/) The path length of a sequence x\, . . . ,xt is 

t=l 

Define A(T, L) to be the set of sequences with T vectors and path length less than L. 
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Definition 34 ( H28\l ) Given an algorithm A and a maximum path length L, the dynamic regret Ra(T, L) is 



R A (T,L) = C A (T)- ^'( T )- 



A'£A(T,L) 



Zinkevich([28 ]) shows that 



Theorem 35 ( h28V ) If rj is fixed, the dynamic regret of the greedy projection algorithm is 



Rg(T,L) < 



7||F|| 2 | L\\F\\ | T^HVcll 2 
rj rj 2 



Black-box application of techniques of [28] requires setting the learning rate -q to the value given by 
path length that one would like to be competitive against. It would be desirable to devise an algorithm 
that is simultaneously competitive against all possible path lengths. Choose rjj = 2~ J , j = 1, . . . ,logT, 
Pi = 1 — 2~ l ,i = 1, . . . ,logT. Let Sij be the strategy that applies the gradient descent algorithm with 
rj = rjj, p = pi. We then have 

Theorem 36 Choose any Z < 1/e and any partition of[l : T] into disjoint intervals Ij,j = 1, . . . , k. Let the 



desired path length for Ij be jj\\F\\\Ij\. Then the regret of the tree-based comparison algorithm is at most 
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A Proofs of Lemma [I4a and Lemma H5a 

Lemma H4k Suppose that the function g(x) used in Algorithm\I}satisfies 

max \g'{s')\(s' - x) 2 /2 < p t xg(x)h{x) + Z' 

s'&[px-\b t \,px+\bt\] 

for a function h(x),0 < h(x) < l,\/x, Z' > 0, and all < A < 1. Also, suppose that 



g(s)ds 



< pz\g(z)\/2 



for any r] > rf. Then 

1. the payoff of the algorithm satisfies 

t t 
Y,9(bj)bj > Y,Pj x j9(xj)(l - h(x)) +$ t - Z't 
3=1 3=1 

2. if pt = p, then at each time step 1 < t < T the -q-smoothed payoff of the algorithm satisfies 

t 

J^v^gibj^y^t-z'/ii-p). 

3=1 

as long as \bt\ < lfor all t. 

Proof: We will show that at each t 

$ t+ i - r)$ t < b t g(x t ) + Z' - p t x t g(x t ){l - h(x t )). 

Thus, (1) will follow by noting that rj < 1 and summing over t from 1 to T, and (2) will follow from 

t 

3=1 

t t 

<Y. r l~ j b 3 9{x j ) + Z'/(l - p) -^p^pjxjgix^l - h( Xj )), 

3=1 3=1 

and the fact that $i = 0. 

We consider the case xt > 0. The case x% < is analogous. In the following derivation we will write 
[A, B] to denote [min{A, S},max{^4, B}]. 

< b t < 1: We have x t+1 = p t x t + b t = x t — p~t x t + hi an d th e expected gain of the algorithm is g(x t )b t . 
We have 

l-xt-ptxt+bt 



pxt-ptxt+ut 

-rfi} t = / g{s)ds 

Jxt 



'xt 

<g{x t ){-p t x t + bt) + fi^t+ max \g'(s')\(s' -x t ) 2 /2 

s'€[xt,xt—ptxt+b t ] 

< g(x t )(-p t x t + b t ) + px t g(x t )/2 + max \g'(s')\(s' - x t ) 2 /2 

s'e[xt,x t —ptxt+bt] 



g(x t )ptX t /2 + max \g'(s')\ (s' - x t ) 2 /2 

s'£[x t ,xt—ptxt+bt] 



< g(x t )b t + 

< g(x t )b t + (l/2)(-l + h(xt))p t x t g(x t ) + Z' 



as required. 
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-1 < bt < 0: We have x t +i = pt^t + h = x t — ptx t + b t , and the expected gain of the algorithm is g(xt)b t . 
We have 



•/ xt—otxt+bt 



i xt— ptxt+bt 

<-g{x t ){ptXt + b t )+f)<$> t + max - x t ) 2 / 2 

< -g(x t )(ptx t + b t ) + p t x t g(x t )/2 + max |0'(s')|(s' - x t ) 2 /2 

s'€[a;t— ptxt+bt,x t ] 



g(x t )p t x t /2 + max |g'(s')|(s - x 4 ) 2 /2 

s'£pt- ptxt+bt,x t ] 



< g(x t )b t + 

< g(x t )b t + (l/2)(-l + h(x t ))ptx t g(x t ) + Z'. 



The following lemma shows that the function g{x) defined in © satisfies all required properties: 
Lemma ITSb Let L > be such that p > A 2 /n,l/n > 1/L 2 . Then for n > 401og(l/Z) the function 
g(x) defined in (O satisfies 



max \g' (s)\(s — x) 2 < pxg(x)h(x) + Z, 

sS[px— A,px+A] 

where h(x) is the step function defined above. 

Proof: The intuition behind the Lemma is very simple. Note that s G [px — A, px + A] is not much 
further than A away from x, so g'(s) is very close to g'{x) = (■^ I )g(x). Since p > A 2 /L 2 , we have 
g'(x) = (-^)g(x)A 2 < pxg(x)/2. In fact the function g(x) was chosen to satisfy this differential equation. 
In what follows we make this intuition precise. 

First note that g'(s) = for \s\ > U, and g'{x) < Z when x < L so we only need to consider 
L - 1 < x < U/p+1. 

We have \g'{s)\ < {s/2L 2 )g{s) for \s\ < U/p+1. Thus, 



max |</(s)| < I max (s/x) • I max g{s)/g{x) ■ \g'{x)\ 

s€[px-A,px+A] \se[px-A,px+A] J \s£[px-A,px+A] J 

<(1 + 1/(L-1)) max e ( 2 ^-x)Ms-x f)/2L * { x (13) 

s£[px-A,px+A] IL A 

< (1 + 1/(L - 1))(1 + (2x(p + 1) + (p + l) 2 )/2L 2 )(^)g(x) 

since x < U/p + 1. 

Finally, when x < U/p + 1, 

max (s - x) 2 < (px + A) 2 < (pU / p + A) 2 . (14) 

s'e[px-A,px+A] 



Putting ([13]) and (TJj) together, we get 



max \g'(s')\(s' — x) 2 < pxg(x) 

s'e[px-A,px+A] 



for L — 1 < x < U/p + 1. Also, we have 



max \g'(s')\{s' - x) < Z 
s'£[px-A,px+A] 



for x < L, which gives the result. 
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